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DIRECT  SEARCH  METHODS  ON  PARALLEL  MACHINES 

J.E.  DENNIS,  JR.  AND  VIRGINIA  TORCZON  » 


Abstract.  This  paper  describes  an  approach  to  constructing  derivative-free  algorithms  for 
unconstrained  optimization  that  are  easy  to  implement  on  parallel  machines.  A  special  feature  of 
this  approach  is  the  ease  with  which  algorithms  can  be  generated  to  take  advantage  of  any  number 
of  processors  and  to  adapt  to  any  cost  ratio  of  communication  to  function  evaluation. 

Numerical  tests  show  speed-ups  on  two  fronts.  The  cost  of  synchronization  being  minimal,  the 
speed-up  is  almost  Unear  with  the  addition  of  more  processors,  i.e.,  given  a  problem  and  a  search 
strategy,  the  decrease  in  execution  time  is  proportional  to  the  number  of  processors  added.  Even 
more  encouraging,  however,  is  that  different  search  strategies,  devised  to  take  advantage  of  additional 
(or  more  powerful)  processors,  may  actuaUy  lead  to  dramatic  improvements  in  the  performance  of 
the  basic  algorithm.  Thus  search  strategies  intended  for  many  processors  actually  may  generate 
algorithms  that  are  better  even  when  implemented  sequentially.  The  key  difference  is  that  the 
additional  processors  are  not  used  simply  to  enhance  the  performance  of  an  inherently  sequential 
algorithm;  they  are  used  to  spur  the  design  of  ever  more  ambitious — and  effective — search  strategies. 

The  algorithms  given  here  are  supported  by  a  strong  convergence  theorem,  promising  computa¬ 
tional  results  on  a  variety  of  problems,  and  an  intuitively  appealing  interpretation  as  multidirectional 
Une  search  methods. 

Key  words,  unconstrained  optimization,  direct  search  methods,  multidirectional  search,  parallel 
optimization,  Nelder-Mead  simplex  algorithm 

AMS(MOS)  subject  classifications.  65K05,  49D30 

1.  Introduction.  We  consider  the  nonlinear  unconstrained  optimization  prob¬ 
lem 


min  /(x) , 
xeiR"  v  ’ 


where  /  :  HI"  — IR.  We  do  not  use  any  derivatives  or  finite  differences  in  our  search 
schemes.  These  schemes  qualify  as  direct  search  methods  since  the  search  is  driven 
solely  by  function  information.  We  require  only  that  /  be  continuous  on  a  compact 
level  set  to  prove  convergence  for  these  methods;  however,  to  guarantee  convergence 
to  a  stationary  point  we  require  that  /  be  also  continuously  differentiable.  These 
convergence  results  will  be  discussed  further  in  §4. 

Originally,  our  interest  in  direct  search  methods  was  based  on  the  fact  that  there 
had  been  only  limited  progress  in  designing  effective  parallel  optimization  algorithms 
that  could  take  advantage  of  a  large  number  of  processors  over  a  fairly  wide  range 
of  problems.  Meanwhile,  current  predictions  suggest  that  to  achieve  teraflop  perfor¬ 
mance  by  the  end  of  the  century  will  require  machines  with  8000  to  32000  proces¬ 
sors.  Workstations  with  up  to  1000  processors  are  on  the  drawing  board.  What  is 
needed  is  algorithms  that  can  be  easily  scaled  to  accommodate  ever  larger  number  of 
processors — as  well  as  ever  more  powerful  processors.  We  believe  that  the  algorithms 
we  propose  in  this  paper  constitute  progress  in  this  direction. 

The  simplicity  of  direct  search  methods  suggested  to  us  that  they  might  be  easily 
adapted  to  a  parallel  computing  environment  precisely  because  they  would  be  more 
amenable  to  scaling.  A  survey  of  the  scientific  literature  also  revealed  that  at  least 
one  direct  search  method,  the  Nelder-Mead  simplex  algorithm  [15],  numbers  among 
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the  more  popular  optimization  methods  in  scientific  computing.  The  simplicity  of 
the  direct  search  methods  certainly  explains  much  of  their  popularity.  It  is  also  true 
that  a  lack  of  derivatives,  as  well  as  “noise”  in  the  function  values,  may  preclude 
the  use  of  methods  that  require  derivatives.  Thus  we  believe  that  if  we  can  use 
parallelism  to  improve  the  performance  of  direct  search  methods,  any  improvement 
will  be  of  immediate  interest  and  possible  use.  Recent  experiments  with  problems 
from  cancer  research  [1],  chemical  engineering  [8],  and  stability  analysis  for  matrix 
computations  [9]  have  convinced  us  and  our  users  that  the  approach  given  here  is  a 
valuable  addition  to  the  optimizer’s  toolkit. 

The  purpose  of  this  paper  is  to  describe  these  parallel  direct  search  methods  in 
the  context  of  our  earlier  work  and  then  to  give  some  preliminary  numerical  results 
that  indicate  the  potential  for  this  approach.  These  results  suggest  the  merit  in 
pursuing  the  further  development  of  parallel  direct  search  methods.  One  goal  is 
to  demonstrate  that,  in  the  context  of  direct  search  methods,  computing  additional 
function  values  at  each  iteration  can  reduce  the  elapsed  time  to  completion  of  an 
algorithm  in  a  parallel  computing  environment  and  may  actually  reduce  the  total 
number  of  function  evaluations  required  to  produce  an  acceptable  solution  since  the 
number  of  iterations  required  to  reach  a  satisfactory  solution  may  be  significantly 
reduced.  Thus,  even  though  we  are  doing  more  work  at  each  iteration  (by  computing 
more  function  values  at  each  iteration),  we  learn  so  much  more  about  the  function 
that  we  produce  significantly  better  iterates  and  thus  converge  in  far  fewer  iterations. 
The  net  effect  is  that  even  with  a  more  ambitious  search  strategy,  we  may  actually 
compute  fewer  total  function  values. 

Our  paper  is  organized  as  follows.  Section  2  outlines  the  basic  approach  we  have 
taken  to  develop  parallel  direct  search  methods,  as  well  as  gives  the  assumptions  we 
have  made  about  the  general  parallel  computing  environment.  Section  3  contains  a 
brief  description  of  direct  search  methods  and  the  reason  for  our  interest  in  them,  as 
well  as  a  comparison  of  our  approach  with  several  parallel  implementations  of  quasi- 
Newton  methods  for  solving  the  general  unconstrained  minimization  problem.  In  §4 
we  review  the  multidirectional  search  algorithm,  a  direct  search  method  that  forms  the 
basis  for  our  more  general  parallel  schemes.  We  also  state  the  applicable  convergence 
theorem  from  [21].  In  §5,  we  show  how  to  incorporate  the  basic  multidirectional  search 
algorithm  into  a  core  step  that  can  then  be  augmented  to  take  better  advantage  of  the 
available  computational  resources  while  retaining  the  convergence  properties  of  the 
basic  algorithm.  In  §6  we  outline  the  new  parallel  multidirectional  search  algorithms 
and  discuss  implementation  details.  In  §7,  we  give  some  preliminary  numerical  results 
that  demonstrate  speed-up  on  two  fronts  and  report  our  experience  with  some  “real” 
problems.  In  §8,  we  close  with  some  remarks  concerning  future  directions  for  research. 

2.  Approach.  The  approach  we  take  can  be  viewed  as  an  extremely  flexible 
multidirectional  line  search  method  that  can  be  easily  scaled  to  fit  the  number  of 
processors  available — regardless  of  the  size  of  the  problem  to  be  solved.  Complete 
use  of  the  available  processors  is  accomplished  in  two  ways.  First,  additional  search 
directions  are  introduced  systematically  in  an  order  that  in  some  sense  reflects  the 
likelihood  of  producing  descent.  In  addition,  the  simple  line  search  conducted  along 
each  direction  is  refined,  with  preference  given  to  those  directions  that  are  deemed 
more  likely  to  produce  descent.  The  work  involved  in  determining  the  search  direc¬ 
tions  and  steps  is  minimal;  we  do  not  need  to  solve  any  linear  systems  of  equations. 
Furthermore,  this  approach  involves  a  minimum  of  inter-processor  communication 
(i.e.,  synchronization)  and  places  few  restrictions  on  the  function  to  be  minimized.  In 
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particular,  there  is  no  need  to  assume  that  the  function  evaluations  are  expensive  in 
order  to  justify  the  overhead  introduced  by  the  parallelization.  This  added  flexibility 
is  significant.  If  the  function  evaluations  must  be  expensive  to  support  the  cost  of  the 
synchronization,  then  as  processors  become  ever  faster,  the  range  of  problems  that 
can  be  solved  efficiently  on  parallel  machines  becomes  ever  smaller.  This  limitation 
will  become  even  more  of  an  issue  as  the  number  of  processors  grows  since  the  cost  of 
access  to  remote  memory  will  become  even  greater.  As  we  shall  see,  one  of  the  most 
attractive  features  of  the  parallel  direct  search  methods  is  that  these  methods  allow 
us  to  stack  function  evaluations  on  each  processor  until  the  cost  of  the  computation 
balances  the  cost  of  the  communication. 

Throughout  this  paper  we  will  assume  that  n  <  p,  where  n  is  the  dimension  of 
the  problem  to  be  solved  and  p  is  the  number  of  available  processors.  While  it  is 
certainly  possible  to  use  these  algorithms  when  p  <  n,  the  more  interesting  results 
occur  when,  in  fact,  n  <C  p  (or,  more  accurately,  when  the  total  number  of  function 
values  computed  at  each  iteration  of  the  algorithm  is  significantly  larger  than  the 
dimension  of  the  problem  to  be  solved). 

The  results  we  will  report  in  §7  have  been  taken  from  an  implementation  on  an 
iPSC/860,  but  these  algorithms  can  be  adapted  to  any  sort  of  parallel  computing 
environment.  This  certainly  includes  either  distributed-memory  or  shared-memory 
multiprocessors.  However,  since  these  algorithms  are  both  small  and  flexible,  they 
also  are  amenable  for  use  on  transputers  or  even  on  a  network  of  computers  that  may 
or  may  not  have  different  performance  characteristics.  There  is  only  one  point  of  syn¬ 
chronization  so  that  modifications  to  suit  a  particular  parallel  computing  environment 
are  straightforward.  Furthermore,  the  information  we  require  for  the  synchronization 
is  so  small,  regardless  of  the  search  strategy  we  employ,  that  even  when  global  com¬ 
munication  or  some  other  form  of  access  to  remote  memory  is  relatively  expensive  it 
is  still  easy  to  choose  a  search  strategy  from  among  those  we  propose  for  which  this 
approach  is  viable.  We  have  concentrated  on  MIMD  machines  only  because  we  are 
ultimately  interested  in  solving  problems  where  the  function  values  are  themselves 
the  result  of  another  (expensive)  process,  for  instance  a  simulation  or  the  solution 
of  a  differential  equation.  We  choose  to  treat  the  function  evaluation  routine  as  a 
“black  box.”  However,  with  the  proper  restrictions  on  the  function  evaluation  rou¬ 
tine,  the  ideas  presented  here  could  also  be  implemented  on  SIMD  machines.  Thus 
we  have  an  approach  that  is  flexible  enough  to  be  of  use  in  a  wide  range  of  computing 
environments. 

3.  Background.  The  methods  given  here  belong  to  the  large  and  often-used 
class  called  direct  search  methods.  Direct  search  methods  are  characterized  by  the 
fact  that  they  do  not  use  derivatives.  Derivative-free  schemes  are  more  widely  ap¬ 
plicable  than  gradient  or  quasi-Newton  methods.  Of  course,  there  is  little  doubt 
that  derivatives,  when  they  are  available,  can  be  used  to  speed  up  the  average-case 
performance  of  nonlinear  optimization  algorithms,  but  sometimes  derivative  approxi¬ 
mations  are  simply  not  practical.  In  some  control  problems  the  objective  calculation 
is  so  expensive  that  finite  differences  are  undesirable  and  the  code  involves  so  much 
branching  that  automatic  differentiation  is  currently  out  of  the  question,  while  an 
expert  might  require  weeks,  or  even  years,  to  find  an  alternative  adjoint  approach  to 
derivative  approximation.  Derivative-free  methods  are  also  quite  robust  in  dealing 
with  objective  functions  that  can  be  evaluated  to  only  a  few  significant  digits,  which 
precludes  the  use  of  finite-difference  derivative  approximations,  regardless  of  expense. 

As  an  indication  of  how  popular  direct  search  methods  are  with  users,  one  need 
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only  consult  the  1989  Science  Citation  Index  [17]  which  lists  more  than  215  citations 
for  the  classic  Nelder-Mead  paper.  Both  the  number  of  citations  and  the  range  of 
journals  in  which  these  citations  occur  has  grown  every  year  since  the  paper  first  ap¬ 
peared  in  1965.  The  most  cited  paper  in  the  vast  literature  on  quasi-Newton  methods 
appears  to  be  the  1963  paper  of  Fletcher  and  Powell  [7],  which  popularized  what  is 
now  known  as  the  DFP  variable  metric  secant  update.  In  the  1989  Science  Citation 
Index  the  Fletcher-Powell  paper  has  114  citations.  As  a  further  indication  of  its 
popularity,  we  note  that  the  Nelder-Mead  simplex  algorithm  also  appears  in  most 
commercially  available  software  libraries.  For  instance,  Nelder-Mead  is  a  standard 
feature  of  such  packages  as  NAG,  IMSL,  and  Matlab. 

Given  the  popularity  of  the  Nelder-Mead  simplex  algorithm,  our  first  attempt 
at  a  parallel  direct  search  method  consisted  of  a  straightforward  implementation  of 
the  Nelder-Mead  simplex  algorithm  on  an  iPSC  hypercube  (now  referred  to  as  an 
iPSC/1).  While  we  computed  n  +  4  function  values  simultaneously  to  complete  all 
the  function  evaluations  that  could  possibly  be  required  during  the  course  of  a  single 
iteration,  the  algorithm  showed  only  a  speed-up  of  order  two,  regardless  of  either  the 
size  of  the  problem  or  the  number  of  available  processors.  Careful  examination  of 
the  behavior  of  the  sequential  algorithm  confirmed  that,  in  fact,  this  was  the  best 
we  could  expect  to  do  with  such  a  naive  parallel  implementation  since  the  sequen¬ 
tial  algorithm  typically  required  only  two  function  values  per  iteration.  Additional 
experimentation  led  to  two  interesting  results.  The  first  was  the  development  of  a 
new  direct  search  method,  which  we  call  multidirectional  search  [20],  that  forms  the 
basis  for  the  algorithms  discussed  here.  The  second  was  the  unexpected  discovery, 
during  our  numerical  testing,  that  the  Nelder-Mead  simplex  algorithm  can  converge 
to  nonminimizers  when  the  dimension  of  the  problem  becomes  large  enough  [20] .  This 
behavior  occurred  for  all  the  test  problems  we  used  from  the  More,  Garbow,  Hillstrom 
problem  set  [12]  where  the  dimension  of  the  problem  could  be  varied.  In  all  cases,  we 
started  with  a  regular  simplex  (i.e. ,  a  simplex  with  edges  of  equal  length)  from  the 
standard  starting  point  given  in  [12],  This  behavior  occurred  even  on  such  a  simple 
problem  as 

min  /(x)  =  xTx 
X€IR" 

for  n  >  16.  One  of  the  advantages  of  the  multidirectional  search  algorithm  is  that, 
unlike  the  Nelder-Mead  simplex  algorithm,  it  is  backed  by  convergence  theorems  that, 
our  numerical  testing  indicates,  are  borne  out  in  practice. 

Initially  our  implementation  of  the  basic  multidirectional  search  algorithm  on  a 
shared-memory  multiprocessor  suffered  from  the  fact  that  the  number  of  processors 
that  could  be  used  was  tied  to  the  dimension  of  the  problem.  The  challenge,  then, 
was  to  find  ways  to  effectively  use  all  available  processors. 

Our  approach  consists  of  embedding  the  original  multidirectional  search  algo¬ 
rithm  in  a  family  of  algorithms.  These  algorithms  have  a  very  appealing  interpre¬ 
tation  as  multidirectional  line  search  methods.  The  original  multidirectional  search 
algorithm  [20]  performs  a  rudimentary  line  search  along  n  search  directions,  hence 
the  dependence  of  the  use  of  processors  on  the  dimension  of  the  problem.  Our  new 
approach  expands  upon  this  idea  by  systematically  introducing  line  searches  along 
new  search  directions  while  further  refining  the  line  search  along  the  existing  set  of 
search  directions.  As  we  shall  see,  this  approach  is  extremely  flexible  so  that  even  for 
a  problem  of  a  given  dimension  on  a  machine  with  a  fixed  number  of  processors  there 
is  a  family  of  closely  related  algorithms,  any  one  of  which  could  be  implemented. 
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The  idea  for  using  all  available  processors,  or  scaling  the  algorithm  to  fit  the  prop¬ 
erties  of  a  given  machine,  is  an  idea  that  we  have  seen  in  other  parallel  optimization 
algorithms.  The  key  difference  is  that  we  do  not  use  additional  processors  simply  to 
enhance  the  performance  of  an  inherently  sequential  algorithm. 

A  Newton  or  quasi-Newton  method  is  essentially  a  sequential  algorithm  that  con¬ 
sists  of  two  distinct  phases.  First,  a  single  search  direction  is  constructed  and  then 
a  step  that  satisfies  some  notion  of  sufficient  decrease  in  the  objective  function  value 
is  determined.  These  methods  are  inherently  sequential  since  a  step  for  the  current 
iteration  cannot  be  determined  until  after  the  search  direction  has  been  constructed, 
while  the  search  direction  for  the  next  iteration  cannot  be  ascertained  until  a  successful 
step  has  been  found.1  There  can,  however,  be  a  significant  amount  of  work  associated 
with  each  phase  of  the  iteration.  Thus,  efforts  to  produce  general,  parallel  Newton 
or  quasi-Newton  methods  have  concentrated  on  parallelizing  the  work  involved  in 
each  phase  [6,  3,  4,  13,  14].  These  efforts  have  been  successful  at  accelerating  the 
performance  of  Newton  or  quasi-Newton  methods  while  preserving  their  convergence 
properties.  However,  the  inherently  sequential  nature  of  Newton’s  method  limits  the 
number  of  processors  that  can  be  successfully  employed  since  the  fundamental  algo¬ 
rithm  remains  essentially  unchanged.  These  limitations  arise  in  two  ways.  Either  a 
fairly  small  number  of  processors  (i.e. ,  no  more  than  twenty)  can  be  used  to  parallelize 
the  linear  algebra  involved  at  each  iteration,  but  this  requires  the  assumption  that  the 
dimension  of  the  problem  is  quite  large  so  as  to  offset  the  cost  of  the  synchronization. 
Alternatively,  the  processors  can  be  used  to  calculate  finite-difference  approximations 
to  the  gradient  and  either  part  or  all  of  the  Hessian.  Then,  for  a  fixed  number  of  pro¬ 
cessors,  the  range  of  the  problems  that  can  be  solved  is  limited,  both  by  the  dimension 
of  the  problem  (i.e.,  0(n)  <  p  <  0(n2),  where  p  is  the  number  of  processors  and  n 
is  the  dimension  of  the  problem)  and  by  the  relative  cost  of  the  function  evaluations, 
again  to  offset  the  cost  of  the  synchronization. 

Our  approach  is  quite  different.  Given  the  dimension  of  the  problem  to  be  solved, 
the  number  of  available  processors  and,  ideally,  some  notion  of  the  relative  expense 
of  the  function  evaluations  to  communication  (or  synchronization)  costs,  we  have  a 
simple  initialization  scheme  that  tailors  the  basic  multidirectional  search  algorithm  to 
fit  these  specifications.  The  result  is  not  just  that  we  produce  a  different  sequence  of 
iterates.  We  actually  produce  different  algorithms  with  different  performance  char¬ 
acteristics.  The  numerical  results  in  §7  suggest  that  we  often  generate  better  direct 
search  algorithms.  This  improved  performance  is  not  accidental.  We  compute  more 
information  about  the  function — more  than  we  could  easily  justify  in  a  sequential 
computing  environment — but  we  use  all  of  the  information  we  compute.  Not  too 
surprisingly,  we  often  construct  a  better  sequence  of  iterates. 

Before  proceeding  to  a  description  of  the  multidirectional  search  algorithm,  we 
note  that  this  is  not  the  only  direct  search  method  we  could  use  as  a  core  step  for  our 
more  general  parallel  direct  search  schemes.  Our  investigation  of  the  theoretical  prop¬ 
erties  of  the  multidirectional  search  algorithm  revealed  that  several  well-established 
sequential  direct  search  methods,  such  as  the  original  factorial  design  algorithm  of 
Box  [2]  or  the  pattern  search  algorithm  of  Hooke  and  Jeeves  [10],  also  have  the  same 

1  Byrd,  Schnabel  and  Shultz  [3,  4]  anticipate  the  search  direction  for  the  next  iteration  by  cal- 
culating,  speculatively,  the  gradient  associated  with  the  trial  step.  The  calculation  is  speculative  in 
the  sense  that  the  outcome  of  the  trial  step  determines  whether  or  not  their  guess  was  correct  and 
thus  whether  or  not  the  information  they  have  already  computed  can  be  used  to  calculate  the  new 
search  direction. 
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essential  characteristics  and  thus  are  amenable  to  both  the  convergence  analysis  given 
in  [21]  and  the  general  parallelization  strategy  given  here  in  §5.  The  reason  we  have 
chosen  to  use  the  multidirectional  search  algorithm  as  our  basic  algorithm  is  that  it 
requires  only  0(n)  function  evaluations  per  iteration  to  guarantee  convergence.  The 
factorial  design  algorithm  also  falls  under  the  same  convergence  analysis  but  requires 
0(n 2)  function  evaluations  per  iteration.  Thus,  when  there  is  a  limited  number  of 
processors  available  (relative  to  the  size  of  the  problem),  the  multidirectional  search 
algorithm  is  more  effective.  Conversely,  when  the  number  of  processors  is  quite  large, 
it  is  easy  to  construct  examples  for  which  the  multidirectional  search  algorithm  and 
the  factorial  design  algorithm  of  Box  generate  the  same  generalized  direct  search  algo¬ 
rithm.  The  similarities  between  these  and  other  direct  search  algorithms  are  discussed 
briefly  in  [21];  a  more  detailed  discussion  is  being  prepared  for  publication. 

We  now  proceed  with  a  description  of  our  core  algorithm. 

4.  The  Basic  Multidirectional  Search  Algorithm.  An  iteration  of  the  ba¬ 
sic  multidirectional  search  algorithm  begins  with  a  simplex  S  in  IR",  with  vertices 
vo,  vi,  •  •  •  vn.  The  best  vertex  Vo  is  designated  to  be  a  vertex  for  which  /(v o)  <  /(vj) 
for  j  —  1,  •  •  • ,  n.  We  now  describe  a  complete  iteration  to  arrive  at  a  new  simplex  S+ . 

The  first  move  of  the  iteration  is  to  reflect  Vi,  •  •  • ,  vn  through  the  best  vertex  vo. 
Figure  1  shows  an  example  for  n  =  2.  The  reflected  vertices  are  labeled  rj  and  r2. 


Fig.  1.  The  three  possible  steps  given  the  simplex  S  with  vertices  (vo,Vi,V2) 

If  a  reflected  vertex  gives  a  better  function  value  than  the  best  vertex,  then  the 
reflection  step  is  called  successful  and  the  algorithm  tries  an  expansion  step.  The 
expansion  step  consists  of  expanding  each  reflected  edge  (rj  —  vo)  to  twice  its  length 
to  give  a  new  expansion  vertex  ej.  In  Fig.  1  the  expansion  vertices  are  labeled  ei  and 
e2. 

In  an  iteration  of  this  basic  algorithm,  the  expansion  step  would  be  tried  only 
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if  the  reflection  step  was  successful,  and  it  would  be  taken  only  if  some  expansion 
vertex  was  better  than  all  the  reflection  vertices.  Thus,  if  we  try  the  expansion  step, 
then  the  new  simplex  S+  is  either  the  expansion  simplex  ((vo,  ei,  62)  in  Fig.  1)  or  the 
reflection  simplex  ((vo,ri,r2)  in  Fig.  1). 

The  other  branch  of  the  basic  algorithm  is  the  case  where  the  reflection  step  was 
unsuccessful,  i.e. ,  no  reflection  vertex  has  a  better  function  value  than  /(v 0).  In  this 
case,  we  take  S+  to  be  the  contraction  simplex  formed  by  replacing  each  vertex  of  the 
worst  n-face  in  the  original  simplex  by  the  point  midway  from  it  to  the  best  vertex. 
Thus,  in  Fig.  1,  the  contraction  step  takes  S+  to  be  (vo,ci,C2). 

To  complete  one  iteration  of  the  basic  algorithm,  we  take  vj  to  be  the  best  vertex 
of  5+. 

Before  giving  the  convergence  result,  we  point  out  the  line  search  flavor  of  the 
algorithm.  In  the  case  n  —  1,  we  first  try  a  step  of  a  given  length  away  from  the 
vertex  with  the  larger  function  value  (the  reflection  step).  If  that  is  successful,  then 
we  try  a  longer  step  (the  expansion  step),  but  if  it  is  not,  then  we  try  a  step  only 
half  as  long  in  the  other  orientation  (the  contraction  step).  If  none  of  the  steps  give 
decrease,  then  the  next  iteration  begins  with  a  step  in  the  same  direction  as  the 
previous  iteration  began  with,  but  only  half  as  long.  Thus,  an  unsuccessful  sequence 
of  iterations  generates  a  backtracking  line  search  with  alternating  orientations.  (See 
Fig.  2.)  This  simple  observation  forms  an  important  part  of  the  convergence  proof. 


ri 


,.++ 


Cl 


Vo 


Fig.  2.  A  simple  backtracking  line  search 


Vl 


The  following  notation  will  be  used  in  the  statement  of  the  convergence  result. 
Let  { vg  }  be  the  sequence  of  best  vertices.  Let  vg  be  the  best  vertex  of  the  simplex 
So  used  to  start  the  algorithm.  Define  the  level  set  of  /  at  vg  to  be 

£(vg)  =  (x  :  /(x)  <  /(vg)}  . 

Given  y  E  IRn,  let  the  contour  C(y)  be 

C(y)  =  {x:/(x)  =  /(y)}. 

Let  X ,  be  the  set  of  stationary  points  of  the  function  /  in  L(vg). 

Theorem  4.1.  Assume  that  L(vg)  is  compact  and  that  f  is  continuously  differ¬ 
entiable  on  L(vg).  Then  some  subsequence  of  {vg}  converges  to  a  point  x*  EX,. 
Thus,  {vg}  converges  to  C,  —  C(x*)  in  the  sense  that 


lim 

fc— ►  OO 


inf 

xec. 


=  0. 


The  assumption  that  /  is  continuously  differentiable  on  L(vg)  can  be  reduced 
to  the  assumption  that  /  is  continuous  on  L(vg);  however,  the  set  X,  must  then 
be  expanded  to  include  all  points  where  the  function  /  is  nondifferentiable  on  L(vg) 
and  where  the  gradient  of  /  exists  but  is  not  continuous.  The  proof  for  both  results 
is  given  in  [21].  Before  we  go  on  to  extend  the  multidirectional  search  algorithm  to 
generate  a  family  of  parallel  algorithms,  let  us  discuss  the  proof  in  a  form  that  extends 
to  the  algorithms  of  the  next  section. 
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First,  we  see  why  the  algorithm  cannot  stall  at  a  Vo  with  a  nonzero  gradient. 
Notice  that  the  edges  of  S  adjacent  to  Vo  form  a  basis  for  IR"  and  so  at  least  one 
edge  is  not  orthogonal  to  V/(vo).  Thus,  either  that  edge  or  its  reflection  is  a  descent 
direction  from  Vo-  Let  us  call  this  a  descent  edge.  Now,  the  only  way  the  algorithm 
can  stay  at  vo  is  to  take  an  infinite  sequence  of  successive  contraction  steps.  However, 
at  the  next  iteration,  the  contraction  simplex  flips  to  the  opposite  orientation  to  form 
the  reflection  simplex  and  so,  along  any  descent  edge  from  Vo,  we  are  generating  a 
pair  of  sequences  of  points,  one  from  each  orientation,  halving  the  distance  from  Vo  at 
each  term  of  the  sequence,  as  seen  in  Fig.  2.  Therefore,  we  will  eventually  get  either 
a  successful  reflection  step  or  a  contraction  step  that  replaces  vo. 

Thus,  the  algorithm  can  be  viewed  as  a  backtracking  line  search  method  where  at 
least  one  of  the  n  search  directions  is  guaranteed  to  produce  descent  if  V/(v0)  0. 

Convergence  would  be  clear  if  some  principle  of  sufficient  decrease  on  /  were  required. 
However,  we  accept  a  new  best  vertex  based  only  on  simple  decrease.  The  remainder 
of  the  proof  consists  of  an  unusual  argument  by  contradiction.  We  assume  that  the 
sequence  of  best  vertices  stays  uniformly  bounded  away  from  the  set  of  stationary 
points.  Using  this  assumption,  along  with  the  compactness  of  the  level  sets,  the 
uniform  linear  independence  of  the  search  directions,  and  the  continuity  of  V /,  we 
can  show  that  all  but  a  finite  number  of  vertices  generated  by  the  algorithm  must  be 
contained  in  a  compact  set  and  lie  on  a  lattice.  Now,  the  first  part  of  the  proof  showed 
that  if  the  best  vertex  is  not  a  stationary  point  of  the  function,  then  the  algorithm  will 
produce  a  strictly  monotonically  decreasing  sequence  of  function  values.  The  second 
part  of  the  proof  demonstrates  that  under  the  hypothesis  that  the  sequence  of  best 
vertices  stays  uniformly  bounded  away  from  the  the  set  of  stationary  points  there  is 
only  a  finite  number  of  function  values.  Therein  lies  the  contradiction.  Thus,  our 
hypothesis  cannot  hold  and  we  have  convergence  to  the  set  of  stationary  points. 

We  close  by  noting  that  as  long  as  we  preserve  the  backtracking  line  search  fla¬ 
vor  of  the  algorithm,  the  proof  for  the  basic  algorithm  will  extend  to  the  parallel 
multidirectional  search  algorithms  we  propose. 

5.  The  Parallel  Multidirectional  Search  Algorithms.  The  strategy  we  will 
employ  to  define  a  family  of  direct  search  algorithms  is  very  simple.  We  will  look 
ahead  to  subsequent  iterations  of  the  algorithm  until  we  generate  a  sufficient  number 
of  vertices  to  keep  all  available  processors  busy. 

We  begin  by  removing  all  the  branching  from  the  basic  algorithm  to  obtain  a 
core  step.  Thus,  the  core  step  consists  of  the  union  of  the  reflection,  expansion,  and 
contraction  steps  from  the  basic  algorithm.  This  core  step  will  require  3 n  independent 
function  values  at  each  iteration.2  Thus,  in  the  two-dimensional  example  given  in 
Fig.  1,  the  core  algorithm  computes  the  function  values  at  the  six  new  vertices  ri,  r2, 
ei,  e2,  Ci,  and  C2  simultaneously.  We  then  choose  as  vj  the  vertex  that  produces  the 
best  function  value  while  S+  is  taken  to  be  the  simplex  that  produced  vj.  If  /(v 0) 
is  still  the  least  function  value,  then  S+  must  be  the  contraction  simplex. 

There  are  two  points  to  be  made.  First,  with  the  stipulation  that  the  contraction 
simplex  must  be  accepted  when  the  core  step  does  not  produce  a  new  best  vertex,  the 
convergence  theorem  still  holds.  Second,  without  the  branching  present  in  the  basic 
algorithm,  we  may  actually  produce  a  different  sequence  of  iterates.  For  example,  our 
choice  of  S+  might  now  be  the  expansion  simplex  even  if  it  would  never  have  been 

2  We  will  assume  for  now  that  the  number  of  processors,  p ,  is  greater  than  3 n.  As  we  shall 
demonstrate  in  §6.3,  it  is  possible  to  remove  this  restriction  on  the  minimum  number  of  processors 
required.  In  fact,  as  we  shall  see  in  §7  this  approach  may  generate  better  sequential  algorithms. 
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constructed  in  the  basic  algorithm.  This  could  happen  if  min{/(ei),  •  •  • , /(e„)}  < 
/(v o)  <  min{/(ri),  •  •  • ,  /(r„)}.  Thus  we  have  a  different  algorithm  that  may  actually 
produce  a  different  sequence  of  iterates. 

We  have  now  used  3 n  processors  to  compute  the  3 n  new  vertices  and  their  asso¬ 
ciated  function  values.  To  take  advantage  of  more  processors,  we  simply  continue  the 
look-ahead  to  subsequent  iterations.  For  instance,  we  could  assume  that  the  reflec¬ 
tion  step  is  accepted  at  the  current  iteration,  in  which  case  one  of  the  n  new  vertices 
associated  with  the  reflection  simplex  will  become  vj.  Thus,  we  can  consider  the  new 
reflection  vertices  that  might  be  constructed  at  the  next  iteration  if  each  of  ri,  •  •  • ,  rn 
were  given  the  role  of  Vq\  (See  Fig.  3.)  We  can  continue  this  look-ahead  to  construct 
all  the  reflection  simplices  that  could  be  considered  at  the  next  iteration  if  any  of  ri , 
•  •  •,  rn,  ei,  •  •  •,  en,  ci,  •  ■  c„,  or  Vo  were  to  become  vj,  as  shown  in  Fig.  4.  There 

is  also  nothing  to  prevent  us  from  including  all  possible  expansion  and  contraction 
vertices  as  well,  as  seen  in  Fig.  5. 


FlG.  3.  The  core  step  with  reflection  steps  from  n  and  T2 

If  we  were  to  construct  all  the  new  vertices  shown  in  Fig.  5  and  compute  their 
associated  function  values  in  parallel,  then  we  would  have  effectively  completed  two 
iterations  of  the  basic  multidirectional  search  algorithm.  However,  the  numerical 
results  we  will  show  in  §7  suggest  that  in  fact  we  typically  do  much  better,  for  reasons 
we  will  now  explain. 

If  we  return  to  the  core  step,  we  see  that  all  the  vertices  we  constructed  lie  along 
n  directions  determined  by  the  n  edges  adjacent  to  Vo,  as  seen  in  Fig.  6.  We  can  view 
the  core  step  as  a  rudimentary  line  search  consisting  of  three  steps  along  each  of  n 
directions.  When  we  proceed  to  the  next  iteration  and  consider  all  possible  reflection 
simplices,  we  introduce  new  line  searches  but  we  also  further  refine  the  search  along  the 
original  n  directions,  as  can  be  seen  in  Fig.  7.  When  we  complete  the  full  look-ahead, 
Fig.  8  shows  that  for  the  example  we  have  constructed  we  have  introduced  five  new 
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line  searches,  each  consisting  of  three  steps,  while  simultaneously  adding  additional 
steps  along  the  two  original  search  directions  to  refine  the  line  search.  If  we  were 
to  continue  this  process  into  the  next  iteration  we  would  find  that  we  would  again 
introduce  new  line  searches,  we  would  begin  to  refine  the  search  along  the  directions 
introduced  in  the  previous  iteration,  and  we  would  further  refine  the  searches  along 
the  original  n  directions. 

Figure  8  suggests  that  we  have  a  true  multidirectional  line  search  method  that 
scales  the  algorithm  by  introducing  new  line  searches  in  a  systematic  way.  The  original 
n  search  directions  are  deemed  most  likely  to  produce  descent  since  we  search  along 
edges  from  vertices  with  higher  function  values  towards  a  vertex  with  a  lower  function 
value.  In  fact,  the  convergence  theorem  guarantees  that  if  Vo  is  not  a  stationary  point, 
then  one  of  these  n  edges  will  produce  descent.  However,  we  hedge  our  bets  by  also 
adding  new  less  likely  search  directions.  Throughout  this  process  we  continue  to  refine 
the  line  search  along  each  of  the  directions  we  have  introduced  with  priority  going  to 
the  more  likely  directions. 

We  prefer  to  interpret  our  direct  search  methods  as  multidirectional  line  search 
algorithms.  The  simplex  interpretation  is  useful  for  generating  and  programming  these 
algorithms,  as  we  shall  see  in  the  next  section;  however,  the  line  search  interpretation 
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FlG.  5.  The  core  step  with  one  complete  look-ahead 
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FlG.  6.  The  n  original  search  directions 

allows  us  to  pose  these  algorithms  as  gradient-related  methods,  which  helps  explain 
both  the  convergence  theory  and  the  performance  characteristics  of  these  algorithms. 

We  also  note  that  we  have  introduced  a  family  of  algorithms,  not  a  single  method. 
We  have  specified  a  scheme  for  introducing  search  directions  and  refining  line  searches, 
but  there  is  tremendous  flexibility  within  this  scheme.  In  the  example  we  have  shown 
in  Fig.  8,  we  have  introduced  33  new  vertices.  On  a  32  processor  machine  we  can 
choose  almost  any  subset  consisting  of  32  of  these  33  vertices  to  define  a  multi¬ 
directional  search  algorithm.  While  each  of  these  subsets  was  generated  using  the 
same  look-ahead  scheme,  each  subset  produces  a  distinct  algorithm  that  may  generate 
a  different  sequence  of  iterates  when  applied  to  identical  problems.  This  observation 
suggests  the  need  for  defining  strategies  to  specify  the  order  in  which  to  introduce 
search  directions  and  refine  line  searches.  The  only  limitation  we  impose  arises  from 
the  convergence  theorem:  we  must  ensure  that  the  backtracking  line  search,  which 
constructs  steps  along  both  orientations  of  the  search  edge,  is  preserved. 

Defining  a  strategy  is  not  difficult.  We  used  the  following  principles  to  design  our 
current  implementation  of  the  parallel  multidirectional  search  algorithms: 

•  We  construct  a  list  of  vertices  until  we  have  enough  vertices  to  assign  to  all 
the  processors. 

•  We  start  this  list  with  vo  as  the  seed.  We  consider  each  vertex  in  the  order  in 
which  it  was  added  to  the  list  and  generate  the  complete  core  step  associated 
with  that  vertex.  The  3 n  vertices  associated  with  a  complete  core  step  are 
then  added  to  the  bottom  of  the  list  of  vertices. 

•  We  give  precedence  to  the  reflection  step,  then  the  contraction  step,  and  then 
finally  the  expansion  step  when  adding  vertices  to  the  list. 

•  We  include  the  current  best  vertex,  with  reduced  edge  lengths,  only  as  part 
of  the  contraction  step  since  in  the  basic  algorithm  there  would  be  a  new  best 
vertex  if  we  accepted  either  the  reflection  or  expansion  steps. 

The  actual  algorithm  we  use  to  implement  this  strategy  is  discussed  in  more  detail  in 
the  next  section. 

There  are  certainly  other,  possibly  better,  strategies  for  generating  parallel  multi¬ 
directional  search  algorithms.  For  instance,  we  could  first  construct  all  the  reflection 
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vertices  associated  with  a  single  iteration  as  in  Fig.  4  and  then  all  the  contraction 
vertices,  etc.  We  could  also  have  mixed  strategies  that  allow  for  different  choices  de¬ 
pending  on  the  type  of  step  that  produced  decrease  in  the  previous  iteration.  These 
ideas  are  the  subject  of  future  research  and  will  be  discussed  further  in  §8. 

Another  important  point  to  note  is  that  once  we  have  specified  a  strategy  for 
generating  both  the  search  directions  and  the  steps,  every  vertex  can  be  represented 
as  a  fixed  linear  combination  of  vo  and  the  edges  adjacent  to  vo.  There  is  no  need 
to  regenerate  the  necessary  coefficients  at  every  iteration.  To  see  this,  consider  our 
example  in  Fig.  7.  Since  (vi  —  vo)  and  (v2  —  vo)  span  1R2,  each  of  the  new  vertices  can 
be  defined  as  the  sum  of  vo  and  a  linear  combination  of  (vj  —  v0)  and  (V2  —  vo).  If  we 
fix  these  coefficients,  and  then  vary  S,  computing  the  new  vertices  at  each  iteration 
of  the  search  reduces  to  computing  a  linear  combination  of  the  edges  adjacent  to  the 
new  best  vertex.  Thus  we  have  a  template  for  the  search  that  is  defined  by  our  choice 
of  strategies  before  the  actual  search  procedure  begins.  Again,  we  will  defer  further 
discussion  of  this  point  to  the  next  section. 

Another  advantage  of  the  static  initialization  scheme  is  that  it  allows  us  to  elimi¬ 
nate  duplicate  vertices  in  the  template.  The  reader  has  probably  already  noticed  that 
once  we  begin  to  look  ahead  to  subsequent  iterations  some  vertices  may  be  multiply 


14 


Direct  Search  Methods  on  Parallel  Machines 


15 


defined.  For  example,  in  Fig.  3,  ei  and  e2  are  redefined  by  the  new  reflection  simplices. 
However,  the  coefficients  necessary  to  construct  these  vertices  from  (vi  —  Vo)  and 
(v2  —  vo)  are  identical,  so  such  duplication  is  easy  to  detect  and  eliminate  during  the 
initialization.  The  only  other  issue  to  be  decided  is  which  simplex  will  be  associated 
with  a  multiply  defined  vertex,  information  that  is  needed  to  avoid  ambiguity  when 
defining  S+  in  the  event  that  this  vertex  becomes  vj .  We  resolve  this  issue  by  breaking 
ties  in  favor  of  the  first  simplex  to  define  the  vertex. 

Having  anticipated  some  of  the  major  points  to  be  addressed  in  any  implementa¬ 
tion  of  the  parallel  multidirectional  search  algorithms,  we  are  now  ready  to  turn  to  a 
more  detailed  discussion  of  our  current  implementation. 

6.  A  Distributed  Memory  Implementation.  We  begin  with  a  statement  of 
the  basic  algorithm,  shown  in  Table  1.  Each  of  the  p  processors3  constructs  one  vertex 
v;  and  its  function  value  fvt .  The  scalars  <Zj,  •  •  • ,  z,-  required  to  construct  the  vertex 
are  local  to  the  processor.  We  assume  that  the  bulk  of  the  computation  occurs  in 
the  evaluation  of  /(v,).  Note  that  we  have  a  single  program  running  on  each  of  the 
processors.  The  data,  in  the  form  of  the  scalars  required  to  compute  the  vertex  v,-, 
varies  on  each  processor.  Thus  we  have  a  single  program/multiple  data  model.  With 
the  appropriate  restrictions  on  the  function  evaluation  routine,  which  we  choose  to 
treat  as  a  “black  box,”  these  methods  could  also  be  extended  to  SIMD  machines. 

Given  an  initial  simplex  So  with  vertices  (vo,  Vx,  •  •  • ,  vn), 
initialize  template 

while  (stopping  criterion  is  not  satisfied)  do 
for  i  =  1 ,  •  •  • ,  p  do 

Vi  v0  +  a,(vj  -  v0)  H - h  Zi(v„  -  v0) 

fVi  <-  /(v.  ) 

end 

/ v,  *—  mini  {fvi }  /*  communication  */ 

update  simplex 
end 


Table  1 

The  Parallel  Multidirectional  Search  Algorithm 


When  each  of  the  processors  has  completed  its  function  evaluation,  there  is  a  single 
global  exchange  to  determine  the  least  function  value.  On  the  iPSC/860  we  can  exploit 
the  hypercube  connectivity  by  using  a  global  handshake  algorithm  in  which  each 
processor  exchanges  the  least  function  value  it  has  seen  with  its  nearest  neighbor.  If 
we  assume  a  hypercube  of  dimension  d,  once  each  processor  has  exchanged  information 
with  its  d  nearest  neighbors,  every  processor  has  fv*.  Notice  that  this  eliminates  the 
need  for  a  single  controlling  process  since  each  processor  can  also  test  for  convergence. 
To  adapt  this  algorithm  to  a  different  parallel  computing  environment,  one  need  only 
make  the  appropriate  modifications  at  this  single  point  of  synchronization  to  introduce 
the  appropriate  form  of  remote  memory  access. 


3  We  assume,  for  now,  enough  processors  to  compute  the  3 n  vertices  associated  with  a  core 
step.  As  we  shall  see  in  §6.3,  satisfying  this  requirement — even  when  we  are  working  on  a  single 
processor — is  straightforward. 
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6.1.  Update.  During  the  course  of  the  global  exchange,  we  pass  three  additional 
pieces  of  information:  the  vertex  v.  that  produced  fv„ ,  the  pointer  source  that  corre¬ 
sponds  to  the  vertex  in  S  that  produced  v„,  and  a  scalar  a*  that,  in  conjunction  with 
v*  and  source,  allows  us  to  construct  the  simplex  St  associated  with  v„.  (The  pointer 
source  and  the  scalar  a,-  associated  with  the  vertex  v;  are  local  to  each  processor  and 
are  assigned  during  the  initialization  of  the  template.)  With  this  additional  infor¬ 
mation,  updating  S  to  produce  S+  is  straightforward,  as  seen  in  Table  2.  Note  that 
before  we  update  S,  we  check  to  see  if  fv »  produced  the  strict  decrease  we  require.  If 
not,  then  Vo  is  still  the  best  vertex;  to  ensure  convergence,  we  reduce  the  lengths  of 
the  edges  in  the  simplex  S  by  the  contraction  factor  9  E  (0,  1). 

if  (/».  <  fva)  then 
vj  <-  v, 
a+  <—  q» 

else 

v+  -  Vo 

a+  <—  9 

source  <—  0 

endif 

for  j  =  0,  •  •  • ,  n  do 

V+  <-  vj  +  <*+(Vj  -  V source ) 

end 

swap(0, source) 

Table  2 

Updating  the  simplex 


6.2.  Initialization.  The  real  effort  in  defining  a  parallel  multidirectional  search 
algorithm  lies  in  the  initialization.  The  key  point  to  emphasize  is  that  this  initializa¬ 
tion  is  static.  A  strategy  is  defined  before  the  search  actually  begins.  The  strategy  can 
be  specified  as  a  template  that  is  fixed',  it  is  the  simplex,  and  not  the  template,  that 
varies  from  iteration  to  iteration.  The  template  is  possible  because  each  vertex  in  the 
search  scheme  can  be  represented  as  a  sum  of  vo  and  a  unique  linear  combination  of 
the  edges  adjacent  to  v0,  as  demonstrated  in  Fig.  9.  Furthermore,  once  we  have  vj, 
it  is  possible  to  construct  5+  from  S  with  just  two  additional  pieces  of  information, 
a+  and  source,  as  shown  in  Fig.  10. 

To  generate  the  coefficients  a; ,  •  •  • ,  z,-  we  need  to  construct  the  vertex  v,- ,  and  the 
associated  a,-  and  source  needed  to  construct  the  simplex  5+  should  v*  produce  the 
least  function  value,  we  use  the  simple  algorithm  shown  in  Table  3.  To  see  why  this 
algorithm  works,  we  begin  by  noting  that  each  of  the  core  reflection,  contraction,  and 
expansion  vertices  are  defined  as  follows: 

Tj  =  V0  +  (-l)(Vj  -  Vo), 

C j  =  Vo  +  (vj  -  v0) ,  and 

ej  -  v0  +  (~2)  (vj  -v0), 

for  j  =  1  ,■■■  ,n.  Note  that  these  definitions  vary  only  in  the  choice  of  the  scalars  —  1, 
i,  and  —2  associated  with  each  step.  We  also  know  that  we  can  construct  5+  from 


=  1  and  source  =  1 
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Given  the  reflection  factor  A  =  —  1,  the  contraction  factor  6  = 
and  the  expansion  factor  p  =  —2, 

/*  initialize  the  root  of  the  tree  by  adding  the  current  simplex  */ 
root  «—  0 
sourceroot  <—  0 
enroot  *  1 

coefficientsroot  <-  0 

i  <—  1 

repeat 

/*  generate  all  possible  new  best  vertices  given  the  current  simplex  */ 
/*  first  consider  all  the  new  reflection  vertices  */ 
for  j  =  0,  •  •  • ,  n,  j  ^  sourceroot 
sourcei  <—  j 

ati  \  *  aroot 

coefficientSi  <—  coefficienisroot 
coefficients^sourcei)  <—  coefficients^sourcei)  +  ft; 
coefficientSi(sourceroot)  *—  coefficientSi{sourceroot)  —  ai 
i  <—  i  +  1 
end 

/*  next  consider  all  the  new  contraction  vertices  */ 
for  j  =  0,  •  •  • ,  ?i 
sourcei  <—  j 

(%i  d  %  &root 

coefficientsi  *—  coefficientsroot 

coefficientsi(sourcei)  <—  coefficients^sourcei)  +  ai 
coefficieniSi(sourceroot)  <—  coefficientsi(sourceroot)  —  a; 
i  <—  i  +  1 
end 

/*  finally  consider  all  the  new  expansion  vertices  */ 
for  j  =  0 ,■■■  ,n,  j  zjz  sourceroot 
sourcei  +—  j 
ai  <  p  *  arooi 

coefficients {  <—  coefficients,. oot 
coefficientSi(sourcei)  < —  coefficientsi (sourcei)  +  a; 
coefficients^sourceroot)  <—  coefficients^sourceroot )  —  a,- 
i  <—  i  +  1 
end 

root  <—  root  +  1 

until  enough  points  have  been  generated 

Table  3 

Initializing  the  template 
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S,  vj,  a+,  and  source  as 

vt  =  vj  +  a+  (vj  -  vsource) , 

for  j  =  0,  ••■,«.  Thus,  given  vj,  a+,  and  source  we  can  construct  the  core  step 
associated  with  vj .  For  instance,  the  reflection  vertices  would  be 

=  v<f  +  (~1)(v/  -  v^) 

—  vo"  +  (~  1)  ((vo"  +  a+  (v;  ~  ^source))  ~  vo" ) 

=  vo"  +  (  — l)(a+)  (vi  —  V source) 

=  Vq"  +  (—  l)(a+)  ((VJ  —  Vo)  —  {vsourCe  ~  vo)) 

for  j  =  0,  -  ,n,  j  jL  source — which  is  exactly  the  representation  we  are  using  to 

construct  the  template. 

We  also  note  that  there  is  additional  work  to  be  done  once  the  list  of  coefficients 
has  been  generated  since  there  are  duplicate  coefficient  vectors.  We  simply  sort  the 
list  and  then  eliminate  duplicates.  (We  also  include  the  n  +  1  vertices  in  the  original 
simplex  when  we  check  for  duplicates  so  that  they  are  not  redefined.)  Preference,  in 
terms  of  the  a  and  source  associated  with  each  coefficient  vector,  goes  to  the  first 
definition.  Note  also  that  if  we  use  6  =  ^  and  p  =  —2  (standard  choices  for  algorithms 
of  this  sort),  then  we  can  scale  the  entire  procedure  by  an  appropriately  large  multiple 
of  2  and  generate  the  template  using  integer  arithmetic,  which  is  faster,  halves  the 
required  storage,  and  eliminates  round-off  error. 

6.3.  Stacking  Computation.  The  last  claim  we  must  substantiate  is  that  it  is 
easy  to  balance  the  cost  of  the  computation  versus  the  cost  of  the  communication, 
or  some  other  form  of  remote  memory  access.  When  we  first  began  testing  the  basic 
multidirectional  search  algorithm  on  a  shared  memory  multiprocessor,  memory  access 
completely  swamped  any  gains  to  be  seen  from  computing  function  values — at  least 
those  from  the  standard  test  set — in  parallel.  One  way  to  overcome  this  imbalance 
would  be  to  assume  that  the  function  evaluations  are  expensive  enough  to  justify 
the  use  of  a  parallel  machine.  However,  this  imbalance  is  even  more  acute  on  the 
iPSC/860.  If  we  had  to  assume  the  function  evaluations  are  expensive  to  justify  using 
a  parallel  machine,  then  as  processors  become  ever  faster,  the  class  of  problems  we 
would  be  able  to  consider  solving  on  a  parallel  machine  would  become  ever  smaller. 

The  advantage  of  the  modification  we  now  introduce  to  the  parallel  direct  search 
schemes  is  that  while  it  introduces  more  computation  on  the  individual  processors, 
this  additional  computation  need  not  have  an  appreciable  effect  on  the  execution 
time  of  the  algorithm.  We  are  simply  trying  to  balance  the  cost  incurred  due  to 
the  global  communication  calls.  Our  modification  is  simple.  Rather  than  assuming 
that  the  function  evaluations  are  expensive,  we  simply  construct  more  vertices  on 
each  processor  and  compute  their  associated  function  values  before  we  synchronize 
the  search.  This  simple  modification  is  given  in  Table  4.  As  we  shall  see  in  the  next 
section,  this  “extra”  work  is  not  wasted;  in  fact,  it  may  actually  lead  to  a  significant 
decrease  in  the  total  execution  time  of  the  algorithm  since  the  more  ambitious  search 
strategy  that  results  may  lead  to  significantly  fewer  iterations. 

We  now  have  all  the  ingredients  we  need  to  claim  that  given  the  number  of 
processors,  the  dimension  of  the  problem,  and  some  ratio  of  the  cost  of  communication 
to  the  cost  of  computing  the  function  value,  we  can  tailor  a  parallel  multidirectional 
search  scheme  to  solve  the  particular  problem  in  the  given  computing  environment. 
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Given  an  initial  simplex  So  with  vertices  (vo,  vj,  •  •  • ,  v„), 
initialize  template 

while  (stopping  criterion  is  not  satisfied)  do 
for  i  =  1 ,  •  •  • ,  p  do 
for  j  =  1 ,  •  •  • ,  k  do 

v‘  <-  v0  +  a'-  (v i  -  v0)  + - b  Zj  (vn  -  v0) 

/»}  <-/(vj) 

end 

fvi  <-  min,-  {fv)  } 
end 

fv,  <—  min,-  {fvi}  /*  communication  */ 

update  simplex 
end 


Table  4 

The  Stacked  Parallel  Multidirectional  Search  Algorithm 


Thus  we  effectively  remove  two  issues  that  have  plagued  the  parallelization  of  other 
optimization  algorithms:  the  dimension  of  the  problem,  n,  and  the  relative  expensive 
of  function  evaluations. 

The  remaining  question  is  then:  How  well  do  these  algorithms  perform?  We  turn 
to  the  next  section  for  some  preliminary  numerical  results. 

7.  Numerical  Results.  We  wish  to  demonstrate  two  important  features  of  the 
parallel  direct  search  methods.  First,  these  algorithms  scale  almost  perfectly  in  the 
sense  in  which  “scale”  is  usually  applied  to  parallel  computation:  if  we  double  the 
number  of  processors  we  use,  we  essentially  halve  the  execution  time  of  the  algorithm. 
In  other  words,  we  have  almost  perfect  linear  speed-up  in  the  performance  of  the 
algorithm. 

However,  our  parallel  direct  search  algorithms  also  scale  in  a  way  that  is  not  so 
usual:  not  only  can  we  increase  the  number  of  processors,  we  can  also  increase  the 
number  of  points  we  compute  on  each  processor  before  we  synchronize  the  search 
by  making  a  global  communication  call.  This  is  the  stacked  parallel  multidirectional 
search  algorithm  given  in  Table  4.  When  we  fix  the  number  of  processors  and  increase 
the  number  of  points  we  compute  on  each  processor  we  are  defining  a  new  search 
strategy;  we  have  a  search  pattern  where  the  number  of  points  in  the  search  pattern 
is  equal  to  the  number  of  processors  times  the  number  of  points  computed  on  each 
processor  before  each  global  communication  call.  When  we  double  the  number  of 
points  in  the  search  strategy,  the  decrease  in  execution  time  can  be  dramatic.  This 
effect  on  the  execution  time,  as  we  shall  see,  is  highly  nonlinear;  with  the  right  choice 
of  strategies  one  may  even  have  a  better  sequential  algorithm  since  the  total  number 
of  function  evaluations  required  to  converge  to  a  solution  will  be  less,  even  if  one  were 
computing  more  function  evaluations  at  each  iteration. 

To  demonstrate  these  two  points,  we  will  “borrow”  the  level  curves  of  a  classic 
test  problem,  Rosenbrock’s  function  [16]: 

f(xi,x2)  =  100(a:2  -  x?)2  +  (1  -  xi)2  . 

We  have  borrowed  the  level  curves  because  Rosenbrock’s  function  is  a  function  that 
causes  most  optimization  problems  some  difficulty  when  the  search  begins  at  the 
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standard  starting  point  (-1.2, 1)  in  the  sense  that  fifty  iterations  is  fairly  typical  of 
better  methods. 

To  make  the  computation  more  realistic,  particularly  given  the  speed  of  the  pro¬ 
cessors  on  the  iPSC/860,  we  have  added  10,000  extra  floating  point  operations  to 
each  function  evaluation.  This  is  why  we  say  we  have  “borrowed”  the  level  sets  of 
Rosenbrock’s  function.  It  is  worth  noting  is  that  when  we  added  100,000  extra  float¬ 
ing  point  operations  to  each  function  evaluation  the  execution  times  did  increase  but 
the  observations  we  now  report  were  qualitatively  the  same.  We  chose  the  lesser  num¬ 
ber  of  extra  floating  point  operations  to  demonstrate  that  the  relative  expense  of  the 
function  evaluation  does  have  an  effect  on  the  efficiency  of  a  parallel  implementation. 

We  tested  our  parallel  direct  search  method  using  most  of  the  minimization  prob¬ 
lems  found  in  More,  Garbow,  and  Hillstrom  [12].  We  have  singled  out  Rosenbrock  s 
function  because  it  gives  fairly  typical  results;  we  observed  similar  behavior  when  we 
experimented  with  other  problems  in  the  More,  Garbow,  and  Hillstrom  test  set.  Later 
we  will  discuss  our  experience  with  some  “real”  problems. 

We  started  each  search  at  the  classic  starting  point  (—1.2,1).  Since  we  require 
a  simplex  to  start  the  search,  we  used  a  straightforward  procedure  found  in  [11]  to 
generate  a  regular  simplex  with  edges  of  length  one.  We  stopped  the  search  when  the 
absolute  value  of  the  function  (at  the  best  vertex  vo)  fell  below  10-7,  a  decrease  of 
eight  orders  of  magnitude.  There  is  nothing  special  about  this  choice;  the  algorithm 
ran  equally  well  for  choices  between  10_1  to  10~10.  (Note  that  the  stopping  test  we 
usually  employ  is  a  little  more  sophisticated  [20];  we  resorted  to  this  naive  choice  for 
simplicity  in  discussing  the  efficiency  of  the  algorithm  in  achieving  some  meaningful, 
well-defined  goal.) 

We  then  solved  the  problem  varying  two  parameters:  the  number  of  processors 
and  the  number  of  points  in  the  search  pattern.  The  results  can  be  seen  in  Figs.  11 
and  12. 

In  Fig.  11,  the  linear  speed-up  in  the  execution  time  is  apparent.  There  are  plots 
for  six  different  search  strategies  involving  search  patterns  of  32,  64,  128,  256,  512, 
and  1024  points.  We  have  plotted  the  log2  of  the  execution  time  in  milliseconds  as  a 
function  of  the  number  of  processors  we  have  used.  If  we  have  linear  speed-up,  then  as 
we  double  the  number  of  processors,  the  execution  time  should  be  halved.  The  solid 
lines  that  bracket  our  plots  in  Fig.  11  demonstrate  the  slope  we  should  see  for  perfect 
linear  speed-up.  And,  in  fact,  we  see  essentially  linear  speed-up  for  all  but  two  of  the 
plots.  The  plots  for  the  32  point  search  pattern  and  the  64  point  search  pattern  do 
show  some  degradation  in  the  speed-up  when  we  use  all  32  processors,  but  this  is  easily 
explained.  Even  with  over  10,000  floating  point  operations  per  function  evaluation, 
we  are  not  doing  enough  floating  point  operations  to  offset  the  overhead  incurred 
due  to  the  communication  if  we  are  only  doing  one  or  two  function  evaluations  per 
processor  before  each  global  communication  call. 

In  Fig.  12,  the  nonlinear  behavior  associated  with  the  algorithmic  changes  is 
apparent.  Here  there  are  plots  for  six  different  choices  in  the  number  of  nodes  used  to 
solve  the  problem.  We  have  plotted  the  log2  of  the  execution  time  in  milliseconds  as 
a  function  of  the  number  of  points  in  the  search  pattern.  For  this  particular  example 
the  best  choice  is  256  points.  Note,  however,  that  a  64  point  search  strategy  is  better 
than  a  128  point  search  strategy  and  a  512  point  search  strategy  is  better  than  a  1024 
point  search  strategy.  But  any  of  the  more  ambitious  strategies  takes  less  time  to 
execute  than  the  conservative  32  point  strategy — even  on  a  single  processor. 

The  “best”  choice  of  search  strategies  will  vary  depending  on  the  test  problem, 
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FlG.  11.  Linear  speed-up  obtained  by  doubling  the  number  of  processors. 
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FlG.  12.  Nonlinear  speed-up  obtained  by  doubling  the  number  of  points  in  the  search  pattern 
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the  starting  point,  the  size  and  orientation  of  the  initial  simplex,  etc.  It  is  clear  that 
there  is  a  certain  art  involved  in  choosing  the  “optimal”  number  of  points  to  be  used 
in  the  search  pattern.  However,  it  is  also  true  that  in  general  the  more  points  used 
in  the  search  strategy,  the  better  the  information  obtained  at  each  iteration  and  thus 
the  better  the  choice  of  new  best  vertices  is  likely  to  be.  Again  we  stress  that  this 
improvement  was  seen  across  all  the  problems  we  tested. 

Furthermore,  even  if  we  do  not  know  a  priori  which  search  strategy  is  “best” 
we  can  still  decrease  the  total  execution  time  simply  by  adding  more  processors. 
Returning  to  Fig.  12,  we  see  that  a  256  point  search  strategy  is  optimal,  regardless  of 
the  number  of  processors  we  use.  However,  any  of  the  search  strategies,  if  implemented 
on  at  least  eight  processors,  will  execute  at  least  as  quickly  as  the  256  point  search 
strategy  on  a  single  processor.  Thus,  even  if  we  do  not  know  the  “optimal”  search 
strategy,  we  can  still  expect  to  see  a  decrease  in  the  execution  time  simply  by  increasing 
the  number  of  processors  we  use.  If  we  wish  to  solve  a  problem  repeatedly,  it  may  be 
worthwhile  to  spend  some  time  identifying  an  “optimal”  search  strategy.  Otherwise, 
we  might  simply  use  as  many  processors  as  we  can  find  (or  afford). 

Given  the  speed  of  the  processors  on  the  iPSC/860  we  do  not  even  require  very 
many  processors  to  undertake  a  more  ambitious  strategy.  We  solved  a  parameter 
identification  problem  in  three  unknowns  that  models  catalytic  cracking  of  gasoil  to 
gasoline  [8]  in  just  over  two  seconds  using  a  32  point  search  strategy  on  eight  proces¬ 
sors.  Each  function  evaluation  required  the  numerical  solution  of  a  system  of  ordinary 
differential  equations.  Using  four  processors  with  the  same  32  point  search  strat¬ 
egy  took  over  four  seconds;  two  processors  required  between  eight  and  nine  seconds. 
Again,  we  observed  the  linear  behavior  with  respect  to  the  number  of  processors. 

Our  tests  of  the  parallel  direct  search  methods  lead  us  to  two  conclusions.  First, 
these  algorithms  do  scale  almost  perfectly  in  the  usual  sense:  as  long  as  we  require 
a  reasonable  amount  of  computation  on  each  processor,  the  communication  require¬ 
ments  are  so  minimal  that  we  see  almost  perfect  linear  speed-up  in  the  performance 
of  the  algorithm,  regardless  of  the  problem  we  solve.  If  we  double  the  number  of 
processors  we  halve  the  execution  time  of  the  algorithm. 

Another  result  is,  at  present,  less  well-understood  and  less  predictable.  If  we 
increase  the  number  of  points  in  the  search  strategy,  which  often  involves  a  significant 
increase  in  the  number  of  function  values  we  compute  (stack)  on  each  node  before  we 
attempt  to  synchronize,  we  may  actually  see  a  marked  decrease  in  the  total  execution 
time  of  the  algorithm.  These  improvements  have  been  dramatic  for  the  more  difficult 
problems. 

It  is  important  to  understand  that  our  parallel  direct  search  methods  do  noi 
place  an  upper  bound  on  the  number  of  processors  that  can  be  used.  Given  ever  more 
processors  we  expect  to  be  able  to  solve  ever  larger  problems.  We  also  note  that  much 
of  the  algorithmic  improvement  depends  on  the  proper  choice  of  a  search  pattern  and 
there  are  many  different  ways  to  generate  search  patterns,  even  for  a  fixed  number 
of  points.  As  we  gain  more  experience  with  these  methods  it  is  possible  that  if  we 
can  define  better  search  strategies  then  we  can  further  increase  the  range  of  problems 
that  can  be  solved  efficiently. 

Finally,  we  note  that  the  simplicity  of  the  parallel  direct  search  schemes  suggests 
that  they  are  very  useful  as  experimental  tools.  All  that  is  needed  is  a  function  eval¬ 
uation  subroutine.  While  one  is  calculating  the  derivatives,  coding  a  subroutine,  and 
then  debugging  the  code  to  use  a  more  sophisticated  optimization  method  on  a  “real” 
problem  where  the  solution  is  not  known,  it  is  possible  to  be  running  experiments 
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using  the  parallel  direct  search  methods  to  determine  reasonable  starting  points  for 
the  more  sophisticated  optimization  method,  as  well  as  getting  a  feel  for  the  general 
behavior  of  the  function.4 

The  simplicity  of  the  parallel  direct  search  methods  also  means  that  they  are 
less  likely  to  fall  prey  to  the  pathologies,  such  as  noise  or  the  lack  of  continuity,  that 
can  plague  more  sophisticated  optimization  methods.  As  further  evidence  of  this,  we 
point  to  the  success  reported  by  Higham  [9]  using  a  sequential  implementation,  in 
Matlab,  of  the  basic  multidirectional  search  algorithm  to  investigate  the  stability  and 
accuracy  of  algorithms  in  matrix  computations.  Higham  observes  that  many  of  the 
questions  of  interest  can  be  expressed  in  terms  of  some  easily  computable  function 
/.  The  catch  is  that  the  function  /  is  usually  not  smooth  and  derivatives,  when 
they  exist,  are  difficult  to  obtain.  Thus,  quasi-Newton  or  conjugate  gradient  methods 
are  not  applicable.  Direct  search  methods,  on  the  other  hand,  prove  to  be  useful 
experimental  tools. 

We  also  have  had  experience  experimenting  with  a  data  set  provided  by  re¬ 
searchers  at  the  University  of  Texas  at  Houston,  M.  D.  Anderson  Cancer  Center 
and  the  University  of  Texas  Medical  Branch  at  Galveston  who  are  trying  to  derive 
mathematical  models  for  the  predictive  value  of  early  CA125  serum  levels  in  epithelial 
ovarian  carcinoma  [1],  They  have  a  model  with  five  parameters  that  they  originally 
fitted  using  NL2S0L  [5].  They  were  concerned  that  NL2S0L  required  almost  900 
function  evaluations  to  return  a  solution.  Experimenting  with  our  parallel  direct 
search  methods,  by  successively  restarting  the  optimization  procedure,  we  were  able 
to  uncover  that  one  of  the  parameters,  which  is  required  to  be  strictly  greater  than 
zero,  was  tending  toward  zero  while  another  parameter,  which  is  unbounded,  was 
marching  steadily  towards  — oo — information  that  was  not  obvious  from  the  solution 
returned  by  NL2SOL.  They  are  now  interested  in  further  experiments  to  try  to  learn 
more  about  the  behavior  of  their  model. 

The  point  here  is  not  that  the  parallel  multidirectional  search  algorithms  produce 
optimal  schemes  for  all  problems  on  today’s  machines.  Rather,  when  the  problem 
to  be  solved  is  small,  but  difficult,  and  only  a  few  significant  digits  in  the  solution 
are  either  required  or  expected,  then  the  parallel  direct  search  methods  provide  to  be 
simple  and  surprisingly  effective  approach.  Furthermore,  these  methods  may  prove 
to  be  even  more  useful  as  experimental  tools. 

8.  Future  Work.  The  next  step  is  to  try  the  parallel  multidirectional  search 
schemes  on  an  inverse  problem  in  multidimensional  wave  propagation.  This  problem 
can  be  formulated  via  coherency  optimization  as  a  low-  (e.g.,  three-)  dimensional 
minimization  problem  [18,  19].  Currently,  an  implementation  of  the  objective  function 
on  a  Stardent  Titan  takes,  on  average,  several  hours  to  return  a  function  value.  There 
is  a  tremendous  amount  of  noise  in  the  data  so  that  the  function  values  cannot  be 
trusted  to  more  than  two  digits.  This  means  that  finite-difference  approximations  to 
the  gradient  are  really  not  feasible — and  that  an  answer  is  expected  to  be  accurate  to 
only  one  or  two  digits.  Thus,  a  quasi-Newton  method  seems  out  of  the  question  and 
a  direct  search  method  seems  to  be  in  order. 

Tackling  this  problem,  however,  means  that  we  will  need  to  rethink  the  original 

4  We  had  an  answer  for  the  parameter  identification  problem  less  than  fifteen  minutes  after 
receiving  the  code  for  the  objective  function.  We  spent  most  of  that  time  reading  the  accompanying 
instructions,  compiling  the  code,  and  then  entering  the  appropriate  data.  It  took  the  iPSC/860  just 
over  two  seconds  to  return  a  solution  on  our  first  try.  Writing  a  routine  to  evaluate  the  derivatives 
using  finite-differences  took  thirty  minutes  and  required  some  finesse.  The  solutions  were  equivalent. 
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implementation  of  our  parallel  multidirectional  search  schemes.  To  begin  with,  our 
current  implementation  is  best  suited  for  the  case  when  all  the  function  evaluations 
require  approximately  the  same  time  to  complete.  Thus,  there  is  a  natural  synchro¬ 
nization  that  allows  us  to  implement  the  algorithm  without  either  a  controlling  process 
or  any  concerns  for  load  balancing.  This  will  not  always  be  the  case  when  dealing 
with  more  difficult  problems.  Hence,  we  will  need  an  asynchronous,  task-queue  based 
implementation  with  a  single  controlling  process. 

Another  direction  of  research  would  be  to  extend  the  parallel  multidirectional 
search  algorithm  to  problems  with  constraints.  We  believe  it  is  possible  to  extend 
the  parallel  algorithms,  with  only  minor  modifications,  to  problems  with  bounded 
variables.  We  are  also  interested  in  handling  linear  constraints.  If  we  can  handle 
bounded  variables,  it  should  be  possible  to  transfer  many  of  the  ideas  learned  during 
the  development  of  interior  point  methods  to  our  simplex-based  method  for  handling 
problems  with  linear  constraints. 

There  are  several  other  ideas  we  would  like  also  to  pursue.  Although  we  have  a 
simple,  fast  algorithm  to  generate  templates  for  the  parallel  multidirectional  search 
schemes,  it  is  possible  that  there  are  other,  perhaps  better,  initialization  schemes  we 
could  implement. 

One  of  the  few  pieces  of  information  that  the  basic  multidirectional  search  algo¬ 
rithm  carries  from  iteration  to  iteration  is  the  size  of  the  step  taken  in  the  previous 
iteration — which  determines  the  size  of  the  step  taken  in  the  current  iteration.  If  an 
expansion  step  was  accepted,  this  would  indicate  that  the  simplex  is  still  far  from  a 
solution.  If  the  contraction  step  was  accepted,  then  either  the  simplex  is  near  a  solu¬ 
tion  or  it  is  trapped  in  a  difficult  region.  If  we  allowed  mixed  strategies,  i.e.,  different 
templates  depending  on  the  type  of  step  accepted  in  the  previous  iteration,  then  it 
seems  possible  that  we  could  further  accelerate  the  search.  This  is  an  idea  we  plan  to 
pursue  further. 

Currently  we  place  no  restrictions  on  the  simplex  used  to  start  the  procedure. 
The  default  option  generates  a  regular  simplex  (i.e.,  a  simplex  with  edges  of  equal 
length).  Another  standard  option  would  be  to  start  with  a  right-angle  simplex:  given 
a  starting  vertex  v0 ,  the  remaining  n  vertices  in  the  simplex  are  generated  using  the 
following  simple  formula 


Vj  • —  Vo  +  ctjej  , 

for  j  —  1,  •  •  • ,  n,  where  otj  is  a  nonzero  scalar  and  ej  is  the  standard  unit  coordi¬ 
nate  vector.  If  we  were  to  restrict  our  attention  to  right-angle  simplices,  then  we 
would  only  require  two  n  vectors  to  store  the  entire  simplex.  Furthermore,  produc¬ 
ing  the  trial  vertices  for  the  search  and  updating  the  simplex  would  be  reduced  from 
0(n2)  operations  to  O(n).  We  plan  to  adopt  this  restriction  when  we  implement  the 
asynchronous  version  of  our  algorithm. 

There  is  also  the  possibility  that  if  the  function  values  are  not  very  expensive  to 
calculate,  and  the  processors  are  very  fast,  then  the  number  of  function  evaluations 
we  would  need  to  stack  on  each  processor,  k,  could  also  become  quite  large.  However, 
as  we  noted  during  our  discussion  of  the  initialization,  the  template  is  generated 
using  integer  arithmetic.  The  information  is  then  rescaled  before  being  sent  out  to 
each  processor.  We  could,  instead,  store  the  template  in  its  integer  form,  which  would 
halve  the  storage  requirements,  and  simply  perform  the  scaling  required  each  time  the 
information  is  used.  A  choice  then  has  to  be  made  as  to  which  is  the  most  efficient 
option. 
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